Collection of Tools & Utilities

home *** CD-ROM | disk | FTP | other *** search

/ Collection of Tools & Utilities / Collection of Tools and Utilities.iso / tex / docprep.zip / DOCPREP.MAN < prev next >

Wrap

Text File | 1992-05-14 | 129KB | 2,719 lines

OCCASIONAL PUBLICATIONS IN ACADEMIC COMPUTING Number 7 DOCUMENT PREPARATION AIDS FOR NON-MAJOR LANGUAGES by Andy Black, David Weber, Fred Kuhl, and Kathy Kuhl Summer Institute of Linguistics, Inc. Dallas, TX 1987 Occasional Publications in Academic Computing is devoted to publishing computer software and documentation deemed to be of potential usefulness to members of the Summer Institute of Linguistics for carrying out their field projects in linguistics, literacy, anthropology, and translation. The software published in the series may represent work in progress. In publishing this software, the Summer Institute of Linguistics, Inc. is making no commitment to maintenance, but is committed to making full disclosure of source code in cases where maintenance requests cannot be serviced. EDITOR: Gary F. Simons ASSISTANT EDITOR: Linda L. Simons This manual documents the WRDCHG, SYLCHK, SYLCOR, SPLCOR, HYPHEN, and DELIM programs. These programs are written in the C programming language for on-the-field application using personal computers or small time-sharing systems. They run under the RT-11, MS-DOS (including Sharp PC5000), TSX, and UNIX operating systems. Copyright (c) 1987, by Summer Institute of Linguistics, Inc. Editorial correspondence or program bugs should be addressed to: Academic Computing Summer Institute of Linguistics 7500 West Camp Wisdom Road Dallas, TX 75236 Requests for further copies, standing orders, or accompanying software diskettes should be addressed to: Bookstore Summer Institute of Linguistics 7500 West Camp Wisdom Road Dallas, TX 75236 CONTENTS 1. INTRODUCTION 4 1.1 Overview of program functions 4 1.2 Overview of program structure 5 1.3 Some lessons from history 6 2. WORD CHANGE (WRDCHG) 8 2.1 Introduction 8 2.2 Making a change table 8 2.3 The default mode 11 2.4 Making a standard format marker field file 11 2.5 Running the program 12 3. SYLLABLE-BASED SPELLING CHECKING (SYLCHK) 14 3.1 Introduction 14 3.2 Running the program 14 3.3 The form of the output 16 3.4 How to write the ONC file 16 3.5 How to write an orthography change table 17 4. SYLLABLE-BASED SPELLING CORRECTION (SYLCOR) 18 4.1 Introduction 18 4.2 Initiating a session with SYLCOR 19 4.3 Screen layout 23 4.4 Handling possible errors: word edit mode 24 4.5 Making the auto-correction and exceptions files 25 4.6 Ending a session with SYLCOR 26 4.7 Writing your own auto-correction and exception files 26 5. SPELLING CORRECTION WITH TABLE LOOKUP (SPLCOR) 27 6. HYPHENATION (HYPHEN) 27 6.1 Introduction 27 6.2 Data files 28 6.3 Running the program 32 6.4 Examples 34 6.5 Miscellaneous 40 7. DELIMITER CHECKING AND NESTING CHECK (DELIM) 42 7.1 Introduction 42 7.2 Running the program 42 7.3 The form of the output 43 7.4 How to write a delimiter file 44 7.5 Program limitations 44 DOCUMENT PREPARATION AIDS 4 1. INTRODUCTION The programs described in this booklet are aids to producing documents. They are useful for a wide range of languages. Each arose in response to a need felt by field linguists involved in producing documents in non-English languages. 1.1 Overview of program functions WRDCHG makes changes to the words of a text, while preserving capitalization, punctuation and formatting. It is useful for correcting spelling and typographic errors, or even for adapting text between closely related dialects. It is simple to use, it allows for conditioning in terms of word boundaries, and it is efficient when hundreds of changes are involved because it stores the changes in a dense form and because it is fast. SYLCHK identifies potential spelling errors in text, using decomposition into syllables as the method for identifying possible errors, and returns these as a list. The user supplies information about the syllable structure of the language. SYLCHK and WRDCHG work together to correct many spelling errors. SYLCHK is first run on the text to collect potential errors. This list is then (optionally) sorted and duplicates are eliminated, and then it is edited to make a list of changes. These changes are then made to the text with WRDCHG. However, this method has a weakness: without context, the user may not know how to correct some errors. For example, if the error were ther, one would not know whether it should be corrected to the, their, there, other, or something else. This sort of case motivated the next program. SYLCOR is an interactive editor for correcting potential errors. SYLCOR identifies potential errors by the syllable decomposition algorithm used in SYLCHK, using the same data files as SYLCHK. When a potential error is found, it is displayed in the upper portion of the screen with the surrounding text and in a work area in the lower portion of the screen, where it can be corrected. If the word is modified, the user may make the change an automatic correction. If it is not modified, the user may add it to one of various lists of exceptions (for example, names, loan words, acronyms, and so on). SPLCOR is like SYLCOR except that, rather than using syllable decomposition for detecting errors, it assumes that a word is an error unless it is found on one of the exceptions lists. This may be a useful approach for languages where the writing system or the phonology (or both!) make syllable decomposition ineffectual as an error detection algorithm. The user simply accumulates a list of all words which are to be passed without further attention. This brings up an interesting question: What are some other useful error detection methods? Syllable decomposition has proven to be useful in many languages, particularly where syllables are fairly restricted and the writing system represents the phonology closely. But it will not yield the same results for every language; Introduction 5 for example, it is less effective for Spanish than it is for Quechua. Another possibility is morphological parsing, i.e. decomposition into morphemes rather than into syllables. For Quechua, morphological parsing is a more effective method than syllable decomposition, but it is also more costly in terms of the complexity of the program, the data which must be provided by the user, and the data which must be loaded each time the program is run. There are other schemes that have been used other languages. One algorithm for English passes a three-character window over the word, looking up the probability for the occurrence of each character triple in a table. (These probabilities are established by running the program in a training mode on large portions of correct text.) The word is rejected or passed as a function of the probabilities of its character triples. I leave the following question with the reader: for the language to which you wish to apply spelling error detection, what would be the best method of detecting possible errors? If you come up with a new idea, perhaps we can prepare alternative programs which are like SYLCOR and SPLCOR, but which have different error detection algorithms. SPLCOR provides the skeleton into which other algorithms for error detection -- ones that you devise -- could be inserted; the program source code is available for those who wish to give it a try. HYPHEN introduces a user-determined character at syllable boundaries. This can be used as a "discretionary hyphen" for formatting with a program like Manuscripter. The user provides data in terms of which the program recognizes syllable boundaries. The user can control how close to the word boundaries the discretionary hyphen may occur, so as to avoid stranding parts of words which are too small. DELIM checks text to see that delimiters (characters like quote marks, brackets, braces, parentheses, and so on) are paired and properly nested. This is useful for technical papers and for computer programs, both of which often contain a great many delimiters. The user has control over what DELIM regards as an opening delimiter character and what is the corresponding closing delimiter. DELIM reports errors by giving the line number, the line, and indicating the offending delimiter. 1.2 Overview of program structure WRDCHG, SYLCOR, SPLCOR, and HYPHEN share the same basic program structure, as proposed in Weber and Kasper "Getting at the Words in Text," Notes on Linguistics 2:17-22 (1983). The module which performs the particular action on a word is lodged between a module TXTIN which separate the word from other characteristics of the text (capitalization, punctuation, formatting), a module TXTOUT which recomposes the text with the possibly-modified word in place of the original word. See the following diagram: DOCUMENT PREPARATION AIDS 6 +--------+ words ------- | ACTION | ---- (modified) words | +--------+ | | | +---------+ punct,capit,format +---------+ | TXTIN | ---------------------- | TXTOUT | +---------+ +---------+ | | input text output text (SYLCHK uses the TXTIN module, but since it does not produce an output text, it does not use TXTOUT.) Because these programs share this structure, they share a lot of code, facilitating both development and maintenance. I suspect that other, future programs could benefit from this architecture, and perhaps even the TXTIN and TXTOUT modules. 1.3 Some lessons from history A bit of history is in order, particularly since it is instructive as to how programs such as these can arise in response to needs felt by field linguists. My involvement in the development of these programs (exclusive of HYPHEN) has been to see the need for a program, to get an approximate conceptualization of the program, to write out some elementary design, to interact with the implementors (answering questions about how I think it should work, providing test data, and so on), and helping to write documentation. The programing expertise was virtually all contributed by volunteers. The first volunteer was Bob Kasper. Bob came to Peru upon finishing his B.S. at Cornell University to implement the Computer Assisted Dialect Adaptation program. As part of this he wrote the TXTIN and TXTOUT functions. The CADA program required a change module, so after that was developed, I suggested that Bob make the WRDCHG program by putting that module between TXTIN and TXTOUT. Since all the pieces were there, it was not a major job, and the first version of WRDCHG was born. About the same time, I began learning the C programming language, and wrote the first version of SYLCHK and DELIM with Bob's help. During Bob's stay in Peru, Alex Waibel (who worked in speech research at Carnegie-Mellon University) came to Peru for a two week "working" vacation. Bob and I had a design document ready for Alex, and about a week and a half after arriving, Alex had a working editor, called CADAED, for application to CADA output text. About two years later, Fred and Kathy Kuhl came to Peru for a six week period. Fred had just finished his doctorate in Computer Science and Kathy had taken several courses in programming. I had written a design of SYLCOR based on my experience with a spelling corrector on another system, and on Bob's TXTIN and TXTOUT, my SYLCHK, and Alex's CADAED. I also had some ideas for how WRDCHG, SYLCHK and DELIM could be improved. Fred and Kathy went right to work, Fred on WRDCHG and SYLCOR, and Kathy on SYLCHK and DELIM. Introduction 7 When Fred and Kathy left six weeks later, the programs were as they now are. SYLCOR incorporates work which Bob, Alex, Kathy and I did, combined masterfully by Fred. Thus, for me, SYLCOR is a monument to cooperation, volunteerism, and professionalism. Bob, Alex, Kathy, and Fred contributed their skills, writing code which others could build upon or building on the work of the former. My role was simply to orchestrate this development. My experience with these programs has confirmed something I first learned by working with Bill Mann: that "linguistic" software is probably best developed as a collaboration between the linguist and the computer professionals. The linguist must identify the problem(s) for which software is needed, conceptualize a program (which must be computationally tractable), and then communicate this to the computer professional, whose responsibility is to refine the linguist's conceptualization and produce the code. And, computer professionals who are willing to go to the field (to where the linguist confronts the situation for which he feels the need for a program) can make a large contribution, even if they only stay a short while. The development of the HYPHEN program suggests another lesson. HYPHEN was written by Andy Black in response to an obvious need to introduce discretionary hyphens for the text formatting demands in the SIL computer center he manages. Andy could have started from scratch and written the program entirely himself. But, being familiar with the architecture and code used for WRDCHG, he used TXTIN and TXOUT. This accelerated his development effort, and will save program maintenence time in the future. Andy's example makes me optimistic about the development of other programs -- as yet unanticipated -- which can be built without exorbitant effort from program parts which are already in hand. If we can make our software development cooperative in this way, each building as much as possible on the work of others rather than starting from scratch for every program, and if, as discussed earlier, we can bring together the linguist and the computer professional, then perhaps we might be able to fulfill -- to a large measure -- our need for linguistic software. There are other people whose names do not appear as authors but who have contributed considerable effort in bringing this publication to reality. Steve McConnel ported the programs to the other operating systems and in doing so cleaned up several inconsistencies within and between the programs. Gary Simons provided general editorial advice and offered suggestions to make the programs more general so they could be used in language families quite different from the one they were originally designed to work for. Linda Simons tested the ported verions along with Steve and took the documentation through several updates to keep it in line with the program improvements. DOCUMENT PREPARATION AIDS 8 2. WORD CHANGE (WRDCHG) 2.1 Introduction Word Change (WRDCHG) passes over a text, changing words as specified by the user in a change table. WRDCHG can only change words; it cannot change punctuation, format marking or capitalization. (Each output word will have the capitalization of the corresponding input word.) It is possible to condition changes as applying only at word boundaries. The speed of application is not substantially affected by the number of changes in the change table; a large number (perhaps as many as 1500) can be made quickly. It also can apply the changes only to specified standard format fields. This gives the ability to make changes to only the vernacular entries of a dictionary, for example. 2.2 Making a change table A change table is a list of paired strings, each string bounded by double quotes ("). The first string of a pair is called the "match string"; it specifies some pattern to be matched in a text. The second string, called the "substitution string," specifies what is to be substituted for each occurrence of the matched string. Observe the following in writing a change table: 1. The changes in a table may occur in any order (i.e., the order in which changes occur in a table makes no difference in the effect upon any text). Therefore changes cannot be "ordered." That is, a second change dependent upon a condition created by a first change will not work. For example, if the following two changes are in a table, only the first will occur since the program will not scan the input text a second time to find "bi?u". "'" "?" "bi?u" "bi?o" 2. All changes should be given in lower case. It is not necessary to give a change with various capitalizations, as the result of any change will be capitalized just as the original word. For example, the change "yeild" "yield" will change "yeild" to "yield", "Yeild" to "Yield" and "YEILD" to "YIELD". (WRDCHG recognizes only three possibilities, all lower case, all upper case, first character capitalized.) 3. If a character (other than space or tab) appears on a line before the first double quote mark, then that line is regarded as a comment, and any change on that line is not applied. This provides a simple mechanism for disabling a change: simply put some character ahead of the first string. For example, the following line would not make any change: off "this" "that" Word Change 9 4. Any character(s) may be placed between the left and right strings. This allows whatever notation you like to symbolize the change; The following lines have the same effect: "mispelled" becomes "misspelled" "mispelled" --> "misspelled" "mispelled" > "misspelled" "mispelled" "misspelled" 5. Anything following the right string is ignored, so comments may follow the pair of strings; for example, the following three changes are effective: "kachaka" "alliya" `get well' "qo" "qara" `give' "fiyupa" "aliska" `very much' 6. Changes may be specified as applying (a) only at the beginning of a word, (b) only at the end of a word, or (c) only if the complete word is matched. To specify that a change applies only at the beginning of a word, include a space between the leading double quote and the first character of the match string; for example, the following change affects only the first "ka" in "kaykan": " ka" "ke" To specify that a change applies only at the end of a word, include a space between the final character of the match string and the following double quote; for example, the following change affects only the last "na" of "nakananpaqna": "na " "nya" To specify that a change applies only when the complete word is matched, include spaces both at the beginning and end of the match string; for example, the following changes the word "na" when it stands alone, but would not make any change to "nakananpaqna": " na " "nya" 7. A change table may have multiple changes whose match string has the same character string but which differ in terms of boundary conditions. The order of priority for application of changes whose match strings are the same except for boundary conditions is 3 > 2 > 1 > 0 where (0) anywhere within a word (1) only at the end of a word (2) only at the beginning of a word (3) only when the entire word is matched That is, 3 applies in preference to 0-2, 2 applies in preference to 1 and 0, and 1 applies in preference to 0. (A way to think of this is that the change with the most DOCUMENT PREPARATION AIDS 10 restricted conditions is applied in preference to a change with a less restricted condition.) For example, consider Change Table I: TABLE I "na" "naa" (0) anywhere "na " "nac" (1) only at the end " na" "nab" (2) only at the beginning " na " "nad" (3) if complete word Change Table I changes "Nakamaananpaqna" to "Nabkamaanaapaqnac". The first instance of "na" is changed to "nab" because the change with the "word-initial" condition (2) applies in preference to the change with the "anywhere" condition (0). Likewise, the last instance of "na" becomes "nac" by the change with the "word-final" condition (1) because it applies in preference to the change with the the "anywhere" condition (0). The second instance of "na" is changed by the "anywhere" change because that is the only change whose conditions are met. Change Table I changes the isolated word "na" to "nac". In this case, all of the changes are, in principle, applicable, but the one with the "complete word" condition applies in preference to the others (0-2). Further, consider Change Table II: TABLE II "na " "nac" (1) only at the end " na" "nab" (2) only at the beginning In this case the change which applies only at the beginning of a word (2) applies in preference to the change which applies at the end of the word (1). If the same match string (including boundary conditions) occurs in more than one change in a table, the last given will prevail. Thus, if a table contained the following lines, "number" would be changed to "last". "number" "first" "number" "last" 8. In an instance where one change table makes a substitutiton string for "a" and also for "ab", the "ab" change will be made but the "a" change will not also be made. For instance, in the table "'" "?" "'u" "'o" all occurrences of "'u" willbe changed to "'o" but will not be changed to "?u". All other occurrences of "'" will go to "?". To solve this problem, the second line of the change table should read: "'u" "?o". Word Change 11 2.3 The default mode In many cases, virtually all the changes in a table will have the same condition. For example, suppose that you are working in a language which does not have prefixes, and you wish to make a number of changes to roots. It would be possible to insure that the changes apply only to roots by including a space at the beginning of each match string. However, this has been made unnecessary by providing the appropriate "default mode" at the time of running the program. WRDCHG gives the following prompt: Should changes be made (0) anywhere within a word (1) only at the end of a word (2) only at the beginning of a word (3) only when the entire word is matched Type 0, 1, 2, or 3 : The effect of answering 0 (or RETURN) is that all changes will occur exactly as you have specified them in the change table, including the leading and/or following spaces you have included. The effect of answering "1" is as though a space were included at the beginning of each match string; the effect of answering "2" is as though a space were included at the end of each match string, and the effect of adding "3" is as though spaces were added at the beginning and end of the match string. Note that it the appropriate response can make it unnecessary (though not incorrect) to include a space in the actual change table. For example, Change Table III applied with default mode "3" is equivalent to applying Change Table IV: TABLE III TABLE IV "yee" "yey" " yee " "yey" "kyo" "kiw " kyo " "kiw "pok" "puk" " pok " "puk" 2.4 Making a standard format marker field file This file gives you the ability to pick and choose which parts of a standard format file the changes are to apply to. To do this, merely create a file listing the markers indicating the desired fields. If you want all fields or if the file does not contain standard format data, then this file should be empty. The layout of this file is very free. Thus the following are all equivalent: (1) The following markers indicate which fields are to be change: \w \i (2) \w \i (3) \w\i DOCUMENT PREPARATION AIDS 12 2.5 Running the program When WRDCHG starts it prints the following: WORD CHANGE Version 2.3 (12-Dec-86) You are first informed of how much memory is available for a change table by a message like the following: SETUP-ALLOC 22832 bytes for records You are then asked to indicate characters which you wish to have treated as alphabetic characters along with the standard ones. Note that all other characters will be regarded as occurring outside of words. For example, if one wished to change "didn't" to "did not", the apostrophe (') would have to be treated as an alphabetic character; otherwise WRDCHG will treat "didn't" as two words, "didn" and "t". Type RETURN to include these as alphabetic characters: ~' Otherwise type the characters desired: After you respond, WRDCHG will inform you of the characters it will treat as alphabetic. For example, if you responded by typing a tilde (~), you will then see the following: Using the following as alphabetics: ~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz Before you are asked for the name of the change table, WRDCHG needs to know two things, the "trie level" and the "default mode." We will now discuss each of these in turn. The change table is stored in the computer's memory as a type of tree structure, called a "trie." Tries are more efficient than simple lists in two ways: (a) it is possible to find entries much more quickly, and (b) for large tables, more changes can be stored. The degree to which this efficiency is attempted is set by the number you give in response to the prompt: Maximum number of levels in the trie: [99] If there were nothing to pay for the efficiency, one would simply strive for the maximum, responding always with a carriage return. But that is not the case. If the dictionary is not great enough to take advantage of the density you hope to achieve, more space is used than necessary. (It's something like packaging soap in economy sized boxes: if you don't fill them the result takes up more room than necessary.) As a rule of thumb, use 2 or 3 for tables with up to 1000 entries. You will probably develop a feel for what number is appropriate; you might even experiment, loading the same table with different numbers and seeing which number leaves the most space (as reported by messages concerning free space given before and after the change table is loaded). By the way, if you set the number too low (say 0 or 1 for over 500 entries) the time it takes to find each change will increase considerably. Word Change 13 Next you will be asked about the "default mode" by the following prompt: Changes should be made: 0) anywhere within a word 1) only at the end of a word 2) only at the beginning of a word 3) only when the entire word is matched Type 0, 1, 2, or 3 : [0] This has been discussed above in section 2.3. Now that you have provided the "trie level" and the "default mode," WRDCHG is prepared to load a change table. It asks for it with the following prompt: Change table file: When it is finished loading, it informs you of the number of changes loaded and the amount of storage left. For instance, 235 changes loaded. 24733 bytes left, largest space is 14733 bytes. Now you are asked for the name of the file that indicates which specific standard format fields the changes apply to. This is done by the following prompt: Standard format marker field file: (<RETURN> for all fields) See section 2.4 for a discussion of this file. If you want all fields, then merely press the <RETURN> key. You are next asked for the name of the file to be changed: Input file: You are also asked to give a name for the output file (i.e., the changed file). WRDCHG makes up a default file name which you can use by simply responding with a carriage return. For example, if your input file name is abcdef.sfm, then the prompt for an output file will appear as: Output file [abcdef.chg]: and by simply typing a carriage return you can create the output file on the default device with the name abcdef.chg. After the file is processed, you will be informed at the terminal of the number of words which were read and the number which were altered with a message like the following: INPUT: 234 words 234 words read, 7 altered. WRDCHG allows multiple input files (all to be processed with the same change file, the same trie level, and the same default mode). You are asked: DOCUMENT PREPARATION AIDS 14 Next input file: (<RETURN> if no more) If you respond with a file name, you will be asked for an output file name as before, and that file will be processed. If you respond with a carriage return, you terminate WRDCHG and return to the monitor. 3. SYLLABLE-BASED SPELLING CHECKING (SYLCHK) 3.1 Introduction SYLCHK identifies possible typographical errors and misspellings in texts by judging the phonological well-formedness of each word: a word is a possible error if it cannot be decomposed into one or more well-formed syllables. SYLCHK assumes that a syllable is made up of an optional onset, a vocalic nucleus, and an optional coda; the user must supply a table of these for the language to which he is applying SYLCHK. (Obviously SYLCHK cannot be applied in a language whose writing system does not approximate phonological form.) SYLCHK never alters the text to which it is applied. However, it may be used to correct text files in the following way: 1. SYLCHK is applied to one or more text files, accumulating the possible errors in a single output file. 2. This error file is sorted and edited to create a change table for correcting the errors. 3. The change table is applied to the text files with a program like WRDCHG (in this package) or CC (Consistent Changes). 3.2 Running the program When SYLCHK is run the following will appear on the screen: SYLLABLE BASED SPELLING CHECK Version 3.0 (15-Dec-86) You are then informed of how much memory is available by a message like the following: SETUP-ALLOC-10904 bytes for records You are then asked to indicate characters which you wish to have treated as alphabetic characters along with wht standard ones. Note that all other characters will be regarded as occurring outside of words. For example, if one wished to change "didn't" to "did not", the apostrophe (') would have to be treated as an alphabetic character; otherwise SYLCHK will treat "didn't" as two words, "didn" and "t". Press <RETURN> to include these as alphabetic characters: ~' Otherwise type the characters desired: After you respond, SYLCHK will inform you of the characters it will Syllable-based Spelling Checking 15 treat as alphabetic. For example, if you responded by typing a tilde (~), you will then see the following: Using the following as alphabetics: ~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz The next thing SYLCHK does is ask for two things relating to the ONC or Onset-Nucleus-Coda file. Details about the form of this file are given below in section 3.4. The ONC file tells the program which characters (and character sequences) are allowed to form correct syllables. First it asks for the character you have used to separate ONC distribution classes in the ONC file, and then asks for the name of this file: Character which separates ONC distribution classes: [\] ONC file: You will then be asked for an orthography change table. If you do not want to use a change table, simply press <RETURN>. An orthography change table allows you to normalize the spelling of words before they are checked, which may be very useful. For example, in the practical orthography for Quechua, long vowels are represented as two vowels, (e.g. long /a/ is represented as "aa"). However, in the phonological system, long vowels pattern as a vowel followed by a consonant, so long /a/ patterns as an /a/ followed by a consonant [length]. (For a justification of this analysis, see David Weber and Peter Landerman "The Interpretation of Long Vowels in Quechua" IJAL, January 1985, pages 94-108.) In order that SYLCHK treat long vowels in this way, the words are normalized by changing "aa" to "a:"; "ee" to "e:"; and so on, and ":" is listed as a coda in the ONC file. The format of an orthography change table is described below in section 3.5. If an orthography change table is specified, the program will respond with a message: Orthography change table file: [None] 5 changes loaded. Now you are asked for the name of the file that indicates which specific standard format fields the program will check. This is done with the following prompt: Standard format marker field file: (<RETURN> for all fields See section 2.4 for a discussion of this file. If you want all fields, simply press the <RETURN> key. Next you will be asked for an output file: Output file: [con] If you simply type a carriage return, the list of possible misspelled words will be displayed on the terminal. If you type the appropriate device name to refer to your printer it will be printed (without creating a file). If you type a file name, the result will be written to that file. Next, you are asked for the file to be checked with the prompt DOCUMENT PREPARATION AIDS 16 Input file: The program will then begin processing the text. Each time SYLCHK successfully decomposes a word into well-formed syllables, a period will appear on the screen, enabling you to watch its rate of progress. At the end you will see a summary like: INPUT: 386 words. 73 possible errors in abcdef.ghi SYLCHK allows multiple input files to be checked (with the same ONC specifications, etc.). You are asked: Next input file (RETURN if no more): If you respond with a <RETURN> the program will terminate and you will return to the monitor. 3.3 The form of the output The output file will contain, for each file being checked, its name, the potential errors found in that file (with each possibly misspelled word on a separate line), and following the last possible error, the number of possible errors found in that file. Possible errors in HGMK01.SFM akrarkran hanunn wais 3 possible errors 3.4 How to write the ONC file This file informs SYLCHK of the characters and character strings that are acceptable syllable onsets, nuclei and codas. These appear in five sets, corresponding to the following distribution classes: first = only in syllable onset (e.g., kw, sy, n~) second = only in syllable coda (e.g., length) third = in either the onset or coda; if ambiguous, will be interpreted as onset (e.g., k, ch) fourth = in either the coda or onset; if ambiguous, will be interpreted as coda fifth = in the vocalic nucleus (e.g., a, e, i, o, u) Members of each set are mutually exclusive of all other sets, that is, no phoneme can occur in more than one distribution class. The third and fourth classes are listed as they are to solve the problem of ambiguity: how does one divide words that are of the CVCVC pattern? In the third set, onset or coda, phonemes are listed that can occur as either onsets or codae. If a member of this set occurs as the middle C in a CVCVC pattern, the program will interpret it as an onset, that is, CV.CVC. Likewise, phonemes listed in the fourth set, coda or onset, will be interpreted as a coda if they occur as the middle C in a CVCVC pattern, that is CVC.VC. Syllable-based Spelling Checking 17 The beginning and ending of each class is marked by a "\" (backslash). (Thus, the file should contain 10 \'s.) Any characters outside of these five regions is treated as comment (i.e., everything before the first "\", between the second and third, the fourth and fifth, the sixth and seventh, the eighth and ninth, or following after the last "\" is comment.) Within each class, characters and character strings should be separated one or more whitespace characters (tab, blank or carriage return). The ONC file also tells SYLCHK what are acceptable syllable patterns within words. Three patterns are given. The first describes only initial syllables, the third describes only final syllables, and the second describes all medial syllables. The parentheses are used to indicate a syllable, the square brackets indicate an optional phoneme. Be certain there is no matching parenthesis or square bracket missing. Here is a sample ONC file (used for Quechua): NJSYL.ONC modified for SYLCHK v. 3. by Steve McConnel, 12-Dec-86 ONSET ONLY \ dy br pr by b d dr f fw fy gy h hy kl j hw ky kw py pw rr sy ty n~ kr bl n~w \ CODA ONLY \ : \ ONSET OR CODA \ ch g k l ll m n p q r s sh t tr ts w y \ CODA OR ONSET \ \ NUCLEUS \ a e i o u a' e' i' o' u' \ SYLLABLE PATTERNS ([O]N[C]) (ON[C]) (ON[C]) Here is a second example of an ONC file describing a the syllable pattern for To'abaita (Solomon Islands) where the only two syllable shapes are V and CV. ONC.TOB by Linda Simons December 1986 ONSET ONLY \ b d f g gw k kw ng ' l m n r s t th w \ CODA ONLY \ \ ONSET OR CODA \ \ considered onset if ambiguous CODA OR ONSET \ \ considered coda if ambiguous NUCLEUS \ a e i o u \ SYLLABLE PATTERN ([O]N) ([O]N) ([O]N) You should not be unduly concerned about making this table complete. Create a first approximation with those characters that come to mind, and try it out on a text. It will then quickly become obvious which characters and character strings you need to add to the table. 3.5 How to write an orthography change table An orthography change table is a list of paired strings, each string bounded by double quotes ("). The first string of a pair specifies some pattern to be matched in a text, and the second string specifies what is to be substituted for each occurrence of the matched string. For example, the following is the table used for Quechua mentioned above: DOCUMENT PREPARATION AIDS 18 LNGVWL.TAB D. Weber May-30-82 "aa" "a:" "ee" "e:" "ii" "i:" "oo" "o:" "uu" "u:" Observe the following in writing a change table: 1. The changes may occur in any order, that is, their order makes no difference in the effect on the text. 2. All changes should be given in lower case; it is not necessary to give a change with various capitalizations. 3. Any line whose first printing character is not a double quote is treated as a comment. (Note, a space or tab could an effective change, since these are not printing characters.) 4. Any characters may be placed between the left and right strings. This allows whatever notation you like to symbolize the change; the following lines have the same effect: "mispelled" becomes "misspelled" "mispelled" --> "misspelled" "mispelled" > "misspelled" "mispelled" "misspelled" 5. Anything following the right string is ignored, so comments may follow the pair of strings; for example, the following three changes are effective: "kachaka" "alliya" `get well' "qo" "qara" `give' "fiyupa" "aliska" `very much' 4. SYLLABLE-BASED SPELLING CORRECTION (SYLCOR) 4.1 Introduction SYLCOR is a program for correcting misspellings and typographical errors in text. SYLCOR identifies possible errors by judging the phonological well-formedness of each word: a word possibly has an error if it cannot be decomposed into one or more well-formed syllables. SYLCOR assumes that a syllable is made up of an optional onset, a vocalic nucleus, and an optional coda; the user must supply a table of these for the language to which he is applying SYLCOR. (SYLCOR cannot be applied in a language whose writing system does not approximate phonological form, for example, Chinese.) Potential errors in text may be exceptions to whatever method is used to discover them. For example, if error detection for the Quechua language is based on phonological well-formedness, then many Syllable-based Spelling Correction 19 words borrowed from Spanish are exceptions. SYLCOR uses lists (which you create as you corrects text) to skip such exceptional words. You might have a list of loan words, a list of abbreviations, a list of Biblical names, or something else. Potential errors in text may be real errors. SYLCOR allows these to be corrected. Context is sometimes needed to determine what the correct word should be. For example, if you were to encounter the misspelling "ther" out of context, you would not know whether it should be corrected to "their", "there", "other", "the", etc. Therefore, each time an error is suspected, SYLCOR displays a region of text surrounding the suspect word. For many errors, you will simply want to correct the error and continue through the text. For common errors, you may want to have all subsequent instances corrected automatically. For example, you might want all instances of "recieve" to become "receive" automatically. SYLCOR allows you to create (in the process of correcting text) lists of automatic changes. You may choose to have each automatic correction presented for your approval before it modifies the text. When you begin a session with SYLCOR, the files containing exceptions and auto-corrections are loaded. At the end of each text corrected, for each file to which there have been additions, you are asked if you would like to update the file or backup the additions. In this way, the files may be enlarged by each session, and consequently you do less and less work in subsequent sessions. SYLCOR may be applied to many texts in one session. For each input text file, a corresponding output file will be created. SYLCOR deals only with the words of the text, and deals with them only one at a time. All the format marking, punctuation and capitalization are passed unchanged from the input text to the output text. As mentioned above, SYLCOR uses phonological well-formedness for detecting potential errors. SYLCOR's error detector is precisely that of SYLCHK. Both use the same data files, i.e. the same orthography normalization table and the same file of acceptable onsets, nuclei and codae. Before running SYLCOR, you might find it helpful to run SYLCHK on some text; this will help you to develop the data you need in the tables. If you intend to put words into a new auto-corrections or exceptions file during a SYLCOR session, you must create these files before you run SYLCOR. The files may be empty, but you are encouraged to place identifying comments in them, according to the syntax given below (see section 4.9). 4.2 Initiating a session with SYLCOR After giving the command to run SYLCOR you will see first an line showing you the amount of available memory. Then you must respond to some questions so that some files can be loaded and so that certain options may be set. You are first asked for a setup file with the prompt: DOCUMENT PREPARATION AIDS 20 Setup file: [none] If you do not have a setup file, you must answer a series of questions interactively at the terminal. If you provide the name of a setup file, many of the subsequent questions will be answered from the file, and you will be free to seek the beverage of your choice while the files load. The following is a sample setup file: Setup file for using SYLCOR with To'abaita texts ' 2 1 autoco.tob y loan.tob biblic.tob \ onc.tob fields.tob The first line will always be skipped; this allows you to provide an identifying comment. Subsequent lines provide responses to the questions in the order the program asks them as discussed below. There may be from zero to four names of exceptions lists and after the last exception file is given there must a carriage return. If some file cannot be found, setting up becomes interactive, and you must provide the correct responses from the terminal (unless you want to abort SYLCOR, edit the setup file and try again). After being asked for a setup file, you will then be asked which characters you want treated as alphabetic characters in addition to the standard ones: Press <RETURN> to include these as alphabetic characters: ~' Otherwise type the characters desired: All other characters will be regarded as occurring outside of words. For example if you wish to treat "oyo't" as a word, include the apostrophe (') as an alphabetic character; otherwise SYLCOR will treat "oyo't" as the two words "oyo" and "t". After you respond, SYLCOR will inform you of the characters it is treating as alphabetic. For example, if you responded by typing a tilde (~) and an apostrophe ('), you will then see the following: Using the following as alphabetics: ~'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz The auto-corrections and exceptions you are about to be asked for are stored in the computer's memory as a type of tree structure, called a "trie." Tries are more efficient than simple lists in two ways: (i) it is possible to find entries much more quickly, and (ii) for large tables, more changes can be stored. The degree to which this efficiency is attempted is set by the number you give in response to the prompt: Maximum level for trie [no limit]? Syllable-based Spelling Correction 21 If there were nothing to pay for the efficiency, one would simply strive for the maximum, responding always with a carriage return. But efficiency isn't free. If the dictionary is not large enough to take advantage of the density you hope to achieve, more space is used than necessary. (It's something like packaging soap in economy sized boxes: if you don't fill them the result takes up more room than necessary.) As a rule of thumb, respond with 2 or 3 for tables with up to 1000 entries. You will probably develop a feel for what number is appropriate; you might even experiment, loading the same table with different numbers of levels and seeing which number leaves the most space (as reported by messages concerning free space given before and after the change table is loaded). If you set the number too low (say 0 or 1 for over 500 entries) the time it takes to find each change will increase considerably. The next question the program asks is: Minimum length of words to check: [1] Here you should indicate the number of characters in the language's shortest well-formed words. Next you are asked: Auto-corrections file: [none] If you do not want any auto-corrections, simply type a carriage return. If you have an auto-corrections table to load, respond with the appropriate file name. If you do not have an auto-corrections file and you expect to put automatic corrections into such a file, go no further! Use ^C to get back to the monitor and create a file (using a text editor). (The structure of this file is described below in section 4.7.) Run SYLCOR again, and when you get to this point, respond with its name. The auto-corrections will then be added to this file. After an auto-correction file is loaded, you are told how many corrections were loaded. Next comes the question: Query before any auto-corrections? [y] If you respond with "n" or "N", auto-corrections are carried out automatically, without asking you to verify them. The only evidence you will see of a change is the incrementing of the "auto-corr" counter on the screen. If you answer "y", "Y" or simply respond with a carriage return, then each time an auto-correction is discovered the surrounding text is displayed in the upper part of the screen and your approval is sought. For example: "ther" > "their" ? [y] This forces you to decide case-by-case whether a change is appropriate. You will probably want to be queried for auto-corrections at first; if you find that you always answer positively, then you may feel comfortable about dispensing with the warning. DOCUMENT PREPARATION AIDS 22 After the auto-corrections query you will again be informed about how much free memory is available. Next you are asked for an exceptions file: Exceptions file 1: If you respond to this with a carriage return, SYLCOR will assume you do not want to use any exceptions files. As with the auto-correction file, any exceptions files you wish to use must be created before reaching this point. They need not have any entries, but you must respond to this prompt with the name of a previously-created file. An exceptions file is simply a list of words, all lower case. Its order is not significant. It is good to have it begin with an identifying comment line. (If this line begins with a backslash ("\") then the exceptions file can be periodically sorted and the comment line will stay at the top.) After Exceptions file 1 has loaded, you will be told how many exceptions were loaded and informed about the amount of free storage by a message such as the following: 12963 bytes left, largest space is 6568 bytes. If either of the numbers gets below 100, SYLCOR may have problems adding to the exceptions lists or auto-corrections table. If you have loaded an exceptions file, you will be asked for another: Exceptions file 2: You can use up to four exceptions files during any run of SYLCOR. This allows you to keep, for example, Biblical names in one file, linguistic jargon in another, unassimilated loan words in another, and so on. Some applications will not need all of the exceptions files; for instance, correcting Scripture would not need the linguistic jargon and correcting a linguistic paper would not need the Biblical names. When the question appears again, press <RETURN> if you have no more exception files. Next you are asked : Character which separates ONC distribution classes: [\] Next you are asked for an "ONC" table: ONC file? The ONC file defines the possible syllable onsets, nuclei and codae. You must write an ONC file for the language to which you are applying SYLCOR; how to do so is discussed in the previous section on SYLCHK, section 3.4 Next you are asked: Orthography change table file: [none] If you do not want to use a change table, simply press <RETURN>. A Syllable-based Spelling Correction 23 change table allows you to normalize the spelling of words before they are checked, which may be very useful. For example, in the practical orthography for Quechua, long vowels are represented as two vowels, for example, long /a/ is represented as "aa". However, in the phonological system, long vowels pattern as a vowel followed by a consonant, so long /a/ patterns as an /a/ followed by a consonant [length]. In order that SYLCOR treat long vowels in this way, the words are normalized by changing "aa" to "a:"; "ee" to "e:"; and so on, and ":" is listed as a coda in the ONC file. The format of an orthography change table is described in detail in the preceding section on SYLCHK, section 3.5. The next question is: Standard format marker field file: (<RETURN> for all fields) This file will list the specific standard format fields that you want SYLCOR to read. See the preceeding section on WRDCHG, section 2.4, for details of how this file should look. Simply press <RETURN> if you want SYLCOR to read all fields. Remember that all the answers to these questions can be put in a setup file as discussed already. At this point you will see a message on the screen and finally, you are asked for an input file: Input file: To this you must respond with the name of the file you wish to correct. If the file is not found, you will be asked to try again. When the file is found, you are asked for the name of the output file. SYLCOR makes a default file name which you can use by simply typing a carriage return; the default writes to default device and adds and extension .SPL to the input file name. Thus, if you were editing FUNNEY.SFM, the next prompt would be: Output file: [FUNNEY.SPL] Of course, you are free to respond with whatever file name you wish. (On a two-tape system, you will definitely want to have the input and output files on different tapes, as otherwise there will be considerable tape spinning in the course of correcting a file.) 4.3 Screen layout Suppose you initiate a session with SYLCOR in which you are correcting a file TEXT.SFM from the default device and putting the corrected version onto a specified device (such as DD1: or b:) under the name TEXT.SPL (where the new extension indicates that it has gone through a spelling corrector). The following appears slightly above the middle of the screen in reverse video: +-----------------------------------------------------------------+ |SYLCOR TEXT.SFM > DD1:TEXT.SPL 0 words 0 Errors 0 Auto-corr 0 Exc| +-----------------------------------------------------------------+ The region above these lines is for the display of text. The region below is the area in which all your interactions with SYLCOR are displayed, that is, prompts and your responses, as well as word editing. DOCUMENT PREPARATION AIDS 24 As words pass through SYLCOR, the appropriate counts will be incremented. If you finish working on TEXT.SFM and correct another file, the new file names will be displayed and the counters will be reset to zero. Every time a word passes from the input to the output file, the "words" counter gets incremented. If the word is phonologically anomalous, but is already on an exceptions list, the "Exc" counter is incremented. If it is phonologically anomalous but there is an auto-correction for it, the "Auto-corr" counter is incremented. If it is phonologically anomalous and there is neither an exception nor an auto-correction for it, then the "Error" counter is incremented. 4.4 Handling possible errors: word edit mode When SYLCOR suspects an error, you are put into "word edit mode." The word is displayed in reverse video in the top part of the screen with surrounding text. The following line appears just below the middle of the screen: WORD EDIT: <-,->, DEL, CTRL/U, CTRL/R, RETRN when done, ? for help Below this appears the word you are editing, with the cursor positioned directly after it. You may now edit this word. Any character you type will be entered to the left of the cursor, except for the following, which have the effect indicated: <- or CTRL/B moves the cursor back (to the left) one character -> or CTRL/F moves the cursor forward (to the right) one character DELETE or BACKSPACE deletes one character to the left of the cursor CTRL/U or CTRL/W deletes the entire word being edited CTRL/R restores the original word, undoing all the editing RETURN closes the editing on this word ? prints this message If you hold down one of the arrow keys, it will move left or right until you release the key. If you are at the end of the word and move right, the cursor will cycle around to the beginning of the word. If you are at the beginning of the word and move left, the cursor will cycle around to the end of the word. When you have finished editing a word, press the carriage return. If you have changed the word, the original word and the corrected form are displayed as a change, and you are asked if you want to make this change automatic (by adding this to the list of automatic changes). For example, if you have changed "yeild" to "yield", the following is displayed: "yeild" > "yield" ? [n] The "[n]" at the end of this line specifies the default value; if you respond simply with a carriage return, the change will not be added to the auto-corrections. If you want to add this correction, respond with "y" or "Y". After you respond to this question, the program again resumes searching for the next possible error. Suppose that, instead of correcting the word, you want to leave it just as it is. To do so, simply respond with a carriage return. The word will then be unchanged, and you will be asked if you want Syllable-based Spelling Correction 25 to add it to one of the exceptions lists. For example, if you have two exceptions files, LOANS.LST and BIBNAM.LST (for loans and Biblical names, respectively), you will see the following: Add "xxxxx" to exceptions file? 1 - loan.lst 2 - bibnam.lst <RETURN> to not save this exception Type 1, 2, or <RETURN> To this you must respond with a "1", in which case the word will be added to LOANS.LST; a "2", in which case it is added to BIBNAM.LST; or a carriage return, in which case it is not added to any exceptions list, and the program resumes searching for the next possible error. (The program will complain about any other response.) 4.5 Making the auto-correction and exception files When you finish correcting a text file, and the output file has been written, you are then asked if you would like to protect the additions made to the auto-correction and exceptions files. Only the files to which there have been additions will be considered. You are asked: Update auto-corr & all exceptions files to their current names? [n] If you respond with "y" or "Y", all files to which there have been additions are updated under the same name and onto the device from which they were read. Since this involves copying the original file and then writing out the additions, this can take considerable time on a tape based system. If you respond negatively you are given the option to do so file by file. You will see a prompt like the following: For auto-corr file NJAUTO.TAB 1 - save both new and old auto-corrections 2 - save only new auto-corrections <RETURN> to forget new auto-corrections Type 1, 2, or <RETRUN>: This gives you the option to (1) rewrite the file with the additions (which, again, takes a while on a tape-based system) (2) write out a temporary backup file consisting of only the entries you have added since your last update, (3) do nothing about backing up additions. The second alternative takes less time, but in the event of a problem (e.g., a power failure) you must later do a separate operation to append the additions to your original file. If you are making many additions to the auto-correct and exceptions files, SYLCOR may ask you to protect these additions before it gets to the end of the text file you are correcting. This is because SYLCOR has a limited ability to keep track of all the new additions. When it gets to the limit, it wants you to rewrite the file with the additions (i.e., option 1, above) so that it can start afresh remembering new additions. (Note: option DOCUMENT PREPARATION AIDS 26 2 above will not do here, as it does not cause SYLCOR to "forget" the old additions and start a new list.) 4.6 Ending a session with SYLCOR SYLCOR begins the process of terminating a session when you respond with a carriage return to the following prompt: Next input file (<RETURN> if no more): Since you may have done only temporary backup to this point, and would now like to do a full backup, you are again asked Update auto-corr & all exceptions files to their current names? [n] When the matter of backup is settled, you are asked to replace the systems tape if necessary and then type a carriage return before control returns to the operating system: Reinsert system disk if necessary, then press <RETURN>: You will then be returned to the system prompt. 4.7 Writing your own auto-correction and exceptions files It was said above that you must create the files used to hold auto-corrections and exceptions before you run SYLCOR, but that when you create them, you need not put in any entries. If you know beforehand some words you wish to include in these files, you might as well put them in with your editor. Here we discuss the syntax of the auto-corrections and exceptions files. An auto-correction file has the same syntax as an orthography change table (as defined in section 3.5). Each line should contain at most one correction. The match string comes first on the line, followed by the substitution. Both are surrounded by double quotes. Anything on a line outside the quotes is ignored. Any line beginning with any printing character besides a double quote is a comment line and is ignored. Do not use upper case characters (except, perhaps, in comments)! It is good to start it with an identifying comment line. It can be sorted periodically with a line sort, and it can be used with WRDCHG. An auto-correction file does not need to have anything in it. Auto-corrections can be added to it by using it as the auto-corrections file of a SYLCOR session. Thus, you can start an auto-corrections file simply by creating (with an editor) an empty file or a file which simply contains an identifying comment line. Then you can add all the corrections in SYLCOR sessions. An exceptions file contains words, one per line, with no quote marks or blanks. Any line beginning with a non-alphabetic character is ignored and may be used for comments. Again, do not use upper case characters (except, perhaps, in comments)! Spelling Correction with Table Lookup 27 5. A SPELLING CORRECTION WITH TABLE LOOKUP (SPLCOR) SPLCOR is a program for correcting potential misspellings and typographical errors in text. SPLCOR may be applied to many texts in one session: for each input text file, a corresponding output file will be created. SPLCOR deals only with the words of the text, and deals with them only one at a time; all format marking, punctuation and capitalization are passed unchanged from the input text to the output text. It treats every word as a potential error unless the word has been previously entered into an "exception" list. It is possible to have up to four exceptions lists; for example, you might have a list of loan words, a list of abbreviations, a list of Biblical names, etc. SPLCOR allows real errors to be corrected. Since context is sometimes needed to determine what the correct word should be, a region of text surrounding the error is displayed. For example, if you were to encounter the misspelling "ther" out of context, you would not know whether it should be corrected to "their", "there", "other", "the", etc. For many errors, you will simply want to correct the error and continue on through the text. For common errors though, you may want to have all subsequent instances corrected automatically. For example, one might want all instances of "recieve" to become "receive" automatically. SPLCOR allows you to create (in the process of correcting text) a list of automatic changes. You may choose to approve each automatic correction before it modifies the text or to have it applied without your approval. When a session with SPLCOR is initiated, the files containing exceptions and auto-corrections are loaded. At the end of each text corrected, you can refresh the tape or disk copies of these files. In this way, they are enlarged by each session, so you do less and less work in subsequent sessions. A variant of SPLCOR, called SYLCOR, detects potential errors on the basis of phonological well-formedness. It is expected that in the future other spelling correctors will be available which use the SPLCOR shell but have other error detection methods. If you have entered (in the process of correcting text) a certain word, it will be passed as acceptable. For the details of running SPLCOR, see the documentation of SYLCOR (section 4). Ignore all references to the orthography normalization and ONC tables. All other aspects of the SYLCOR are exactly as in SPLCOR. 6. HYPHENATION (HYPHEN) 6.1 Introduction Discretionary hyphens are symbols in a text file that indicate places where word hyphenation at the end of a line is DOCUMENT PREPARATION AIDS 28 allowed. Just as in English we have rules about where words can be divided, vernacular languages do also. Having these symbols in a text as we were working with it would be a nuisance, so the HYPHEN program can be used to put them in just prior to printing or typesetting. The discretionary hyphen character is read by the formatting program Manuscripter (MS) and signals that the word could be hyphenated there if it occurs at the end of a line when printing takes place. This feature is especially helpful in languages that contain many long words. If hyphenation were not allowed, a lot of space would be wasted at the end of each line of print. The HYPHEN program is basically language independent. The user defines which segments or sequences of segments constitute a given syllabification class and then defines the hyphenation rules in terms of these classes. The user also defines which character sequences constitute overstrike units. In Spanish, for example, the class of consonants contains the segments b, l, and r and the sequences br and bl. The class of vowels contains the segments a, á, and i and the sequences ai and ái. One hyphenation rule in Spanish is VCV becomes V-CV. Thus the sequence abri would be hyphenated as a-bri. The program also allows the user to specify where in the word hyphenation is to begin and end. Thus one can tell it to not start hyphenating until there are at least 4 characters at the beginning and to stop hyphenating when there are 3 characters left at the end. This would override any hyphenation rules that might apply near the word boundaries. HYPHEN also allows one to specify to which standard format fields the hyphenation process is to apply. In a dictionary, then, one can have separate classes and rules for the source language fields (such as \w and \i) and for the target language fields (such as \d and \t). If HYPHEN finds a word that has any sequence that has not been defined, it will display an error message on the screen. This message will show what the sequence is, what the word is, and will also state that the word will not be hyphenated. 6.2 Data files HYPHEN uses four user-defined data files which need to be created with a text editor before running the program. 6.2.1 Segment definition file This file contains the information about which segments and/or sequences belong to which classes. The information is to be entered in a specified format. 1. All text up to the first occurrence of the word CLASS (or class) at the beginning of a line is considered to be comment. Hyphenation 29 2. The word CLASS (or class) at the beginning of a line indicates that a new class is about to be defined. The one letter abbreviation for the class should follow the key word CLASS. Any other text after that will be considered comment. 3. From the next line to either the end of the file or to the next occurrence of the word CLASS at the beginning of a line, all characters are considered to be either segments or sequences that belong to that class. 4. Please note that no one unique sequence can belong to more than one class. Thus "a" cannot both belong to the class A and the class V. 5. Also note that HYPHEN will always take the longest possible sequence and assign its associated class to it. As an example, let's suppose that the following classes are defined: CLASS V a ai i CLASS C n r t tr CLASS M ain Then the word "train" will be treated as a "CM" pattern and the word "trait" would be treated as a "CVC" pattern. The following shows an example from Campa Pajonal (a language of the Peruvian jungle). (Note that the front slash (/) and double quote (") preceding a vowel as well as the tilde (~) before an n represent overstrikes that a discussed in section 6.2.2.) Campa Pajonal segment definition file hab 17-May-85 CLASS V Vowels a e i o u aa ee ii oo uu ae oe /a /e /i /o /u "u CLASS C Consonants c ch g j jy m my n ~n p py qu qy r ry s sh t th ts ty tz v vy y CLASS N Word medial nasal consonant clusters mp nqu nth ntz mpy nqy nts nc nt nty nch DOCUMENT PREPARATION AIDS 30 6.2.2 Overstrike unit file This file lists the character sequences that constitute overstrike units. That is, it lists all sequences that will be printed as one character as the text is passed through a Consistent Changes print table. This information is used by HYPHEN to count correctly where to begin or end hyphenating a word. The sequences are to be entered in a specified format. 1. The first line is treated as comment. 2. All following text is considered to be a list of the overstrike units. Each unit should be separated by "white space" (i.e., a space, a tab, or a new line). Capitals and lower case letters do not need to be distinguished. The following shows an example from Spanish. Overstrike definition file for Spanish 05-Jul-85 hab 'a 'e 'i 'o 'u "u ~n 6.2.3 Hyphenation change table This file contains the hyphenation rules. It is to be written in the form of a "change table," although it is different from a Consistent Changes table in several ways. A change table is a list of paired strings, each string bounded by double quotes ("). The first string of a pair is called the "match string"; it specifies some pattern to be matched. The second string, called the "substitution string," specifies what is to be substituted for each occurrence of the matched string. Please note the following when writing a hyphenation change table: 1. Any character(s) may be placed between the left and right strings. This allows whatever notation you like to symbolize the change. The following lines have the same effect: "VCV" becomes "V-CV" "VCV" --> "V-CV" "VCV" > "V-CV" "VCV" "V-CV" 2. Anything following the right string is ignored, so comments may follow the pair of strings. 3. If a character other than space or tab appears on a line before the first double quote mark, then that line is regarded as a comment, and any change on that line will not be applied. This provides a simple mechanism for disabling a change: simply put some character ahead of the first string. For example, the following line would not make any change: Hyphenation 31 off "VCV" > "V-CV" 4. The hyphenation rules are ordered and will be applied as many times as possible. That is, the first change in the table will be made until it cannot be made anymore. Then the second change will be made and so on. This feature has great advantages, but can cause problems if not properly used. It is possible to create an infinite loop with this table! Consider the following changes, where C is the class of consonants, V is the class of vowels, and G is the class of the single segment glottal. "CCC" > "Cc-C" "CC" > "C-C" "VG" > "Vg-" Note the order of the changes. If the double consonant change were put first, it would never see a triple consonant change (CCC would become C-CC and then become C-C-C). Note that the first change converts the second C to a lower case c. This is so that after CCC becomes Cc-C, the second rule will not then convert the CC to C-C. Also note that this same "trick" was applied for the VG change. Without it, we would have an infinite loop: VG would become VG- which then becomes VG--, and so on. 5. The special symbol # indicates a word boundary. Thus "#CV" indicates word-initial CV and "CV#" indicates word-final CV. Please note the following special restrictions on the above: 1. There must be a one-to-one correspondence between the number of non-hyphen characters in the match string and the substitution string. Thus the following will produce unpredictable results: "AI" > "V" (too few char's in sub. string) "C" > "TR" (too many char's in sub. string) 2. When word boundary conditions are indicated in the match string, the substitution string should also include the word boundary symbol (#): "#VCV" > "#V-CV" "VCV#" > "VC-V#" 6.2.4 Stardard format marker field file This file allows the user to specify which standard format fields (in a text containing several fields) are to be hyphenated. Merely list the format markers which indicated the fields the hyphenation rules are to apply. They can be entered in any way. Any text that is not preceeded by a backslash character (\) is considered to be a comment. The following could be an example for a dictionary: DOCUMENT PREPARATION AIDS 32 Pajonl.sfm Campa Pajonal std format marker field file \w words \i illustrative sentences Please note that this file is optional. If no file is specified when the program is run, all fields will be used. 6.3 Running the program When HYPHEN is first run, it begins by indicating the amount of free memory available with a message like: HYPHENATION Version 1.3 (12-Dec-86) SETUP-ALLOC-112832 bytes for records You are then asked to specify which non-alphabetic (i.e., anything other than a-z) characters are included as specifying words. Press <RETURN> to include these as alphabetic characters: ~' Otherwise type the characters desired: If, for example, you were using ' for accent, ~n for an enyee, and "u for a dieresis u, you would want to type: '"~ and then press the <RETURN> key. HYPHEN will then inform you of the characters it will treat as alphabetic (i.e., are used in forming a word). Any other characters will be considered to be punctuation. For example, if you used the example above, the following will be displayed: Using the following as alphabetics: '"~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz It now asks a series of three questions about how to hyphenate. The first is: Discretionary hyphen character: [&] Type the character you wish to use for the discretionary hyphen and press the <RETURN> key. It will assume that you want to use an ampersand (&) if you just press the <RETURN> key. Note that you will also have to inform Manuscripter of the character you use for your discretionary hyphen symbol (with the .dh command). Note that a two-character sequence may also be used (e.g., [- as used by SIL's Printing Arts Department in Dallas for typesetting). The second question is: Hyphenation starts after this many characters: [2] Enter the minimum number of characters in a word that is acceptable for hyphenation to begin and press the <RETURN> key. It will assume you want it to begin after at least two characters if you just hit the <RETURN> key. The third question is: Hyphenation 33 Hyphenation stops at this many characters from the end: [2] Enter the number desired and press the <RETURN> key. It will assume that you want 2 characters if you just hit the <RETURN> key. Now it asks for the files that you have created as discussed in section 6.2. The first one is: Segment definition file: Enter the name of your file and press the <RETURN> key. Secondly, it asks: Overstrike unit file: (<RETURN> for no overstrike units) If you have a file specifying which character sequences constitute one printing segment, enter its name and press the <RETURN> key. If there are no such sequences, merely press the <RETURN> key. Note that if you have overstike characters in your text but do not specify them here, HYPHEN may not correctly delete discretionary hyphens too near the front or too near the end of a word. Then it will ask: Hyphenation change table: Enter the file name of your change table and press the <RETURN> key. After it has loaded the file, it will display how many changes it found. It will now ask: Standard format marker field file: (<RETURN> for all fields) If you have a file specifying which standard format fields are to be hyphenated, enter its name and press the <RETURN> key. If you want to hyphenate the entire text, merely press the <RETURN> key. It now asks: Input file: Enter the name of the text file you wish to be hyphenated and press the <RETURN> key. Then it asks: Output file: [xxxxxx.hyp] where xxxxxx represents the name given for the input text file. Enter the name you want for the hyphenated file and press the <RETURN> key. If you just press the <RETURN> key, HYPHEN will write your file on the default device using an extension of .hyp. After it has processed the file, it will display the number of words it processed and then ask: Next input file: (<RETURN> if no more) Enter the name of any additional files to be hyphenated or press the <RETURN> key. 6.4 Examples Following are three examples from Peruvian languages. The DOCUMENT PREPARATION AIDS 34 explanation of the rules, the hyphenation change table, and the segment definition file are shown for each. 6.4.1 Spanish 6.4.1.1 Hyphenation rules. These are from The New World SPANISH-ENGLISH and ENGLISH-SPANISH Dictionary, edited by Salavatore Ramondino (Signet Books, 1969), pp. 553-554. Consonants 1. ch, ll, rr count as single letters and are never separated: pe-cho o-lla pe-rro 2. Single consonants between vowels go with the second vowel: ca-be-za pa-re-cer 3. The groups pr, pl, br, bl, fr, fl, tr, dr, cr, cl, gr, gl go with the following vowel and are never separated: re-pri-mir co-pla te-cla 4. In other groups of two consonants, whether identical or different, the consonants are divided between the preceeding and the following vowel: res-pi-ro hon-ra ac-ción in-no-ble at-las 5. In groups of three consonants, the first two go with the preceding vowel and the third with the following vowel: ins-tin-to obs-tá-cu-lo Exception: the groups listed in 3 above are not separated: en-tre com-pra tem-plo ins-tru-men-to Vowels 6. In any combination of two of a, e, or o, the syllable is divided between the two vowels: ca-o-ba i-de-a-ción 7. In any combination of two vowels in which one is a, e, or o and the other is i or u, and there is no accent mark on the i or u, the vowels form a diphthong and are not separated: jo-fai-na vian-da em-bau-car men-guan-te vi-rrei-na con-tien-da en-deu-dar-se con-sue-lo co-loi-dal na-cio-nal duo-de-no If there is an accent mark on the a, e, or o of the group, Hyphenation 35 the two vowels still form a diphthong and are not separated: es-táis es-co-géis cuán-do If the accent mark falls on the i or u of the group, the two vowels do not form a diphthong and are separated: ca-í-da pen-sa-rí-a-mos a-ta-úd re-ú-ne 8. In any combination of i and u, that is, ui or iu, no division of syllables is made between these two vowels. This holds whether there is an accent mark or not: ciu-dad rui-do ca-suís-ti-co 9. In any combination of three vowels in which the first one is i, u, or ü (more than three do not occur), there is no division of syllables between any two vowels of the group. This holds whether there is an accent mark on any of the vowels or not: a-pre-ciáis These rules can be simplified to the following hyphenation rules and segment defintions. 6.4.1.2 Segment definition file. Table 1 shows the segment definition file needed for Spanish. Spanish segment definition file hab/sp 08-Jul-85 This data is from The New Word SPANISH-ENGLISH and ENGLISH-SPANISH dictionary, ed. by Salvatore Ramondino, 1969, pp. 553-4 (V. Division of Syllables in Spanish). CLASS C Consonants b bl br c ch cl cr d dr f fl fr g gl gr h j k l ll m n ~n p pl pr qu r rr s t tr v x z y CLASS V Vowels a e i o u 'a 'e 'i 'o 'u ai ia ei ie oi io ui iu "u'e au ua eu ue ou uo "ue "ui "u'i 'ai i'a 'ei i'e 'oi i'o 'ui i'u 'au u'a 'eu u'e 'ou u'o 'iu u'i i'ai i'ei u'ai u'ei "u'ei DOCUMENT PREPARATION AIDS 36 uia ui'a uio ui'o uie ui'e Table 1 - Spanish segments 6.4.1.3 Overstrike unit file. Table 2 shows the overstrike unit file needed for Spanish. An accented vowel is preceded by a single quote ('), a dieresis on a u is indicated by a double quote ("), and an enyee is indicated by a tilde (~n). Overstrike definition file for Spanish 05-Jul-85 hab 'a 'e 'i 'o 'u "u ~n Table 2 - Spanish overstrikes 6.4.1.4 Hyphenation change table. Table 3 shows the hyphenation change table needed for Spanish. Spanish hyphenation rules hab 17-May-85 "VCV" > "V-CV" "CCC" > "Cc-C" "CC" > "C-C" "VV" > "V-V" Table 3 - Spanish hypenation rules 6.4.2 Amarakaeri This is a Peruvian jungle language which belongs to the Harakmbet language family. 6.4.2.1 Hyphenation rules. Amarakaeri has the following hyphenation rules (as provided by Bob Tripp): 1. When a sequence of vowel-consonant-vowel occurs, a break may be made following the first vowel, except when the consonant is d, g, or y. ya-ti-huad When a vowel is followed by a glottal, the break is made after the glottal. o'-hua'-po When a sequence of vowel-consonant-glottal-vowel occurs, the break is made between the consonant and the glottal. mo'-en-'uy-ne on'-haudiay-'uya-te 2. A break may be made between two consonants. Hyphenation 37 arat-but yan-nig-pee' The digraph hu should not be broken. hua-hue' pak-hue' huey-pa jo-nan-hua-hua-hue' When a glottal occurs between two consonants, the break should be made after the glottal. On'-ka'-a-po on'-no-kie'-uy on'-tia-huay-po 3. A break may be made between two vowels. o'-e-a-po hua-e'-e-ri However, the vowel clusters oe, oe, ee, ae, ia, ie, io, io should not be broken. no-poe'-dik on'-no-po'-toe-po tia-huay-hued be-tio-ka' When a cluster of three vowels occurs, break following the second vowel. a'-nig-pei-a'-po mo'-ma-noe-an-hua-hui-ka'-a-po-ne In any vowel cluster including a glottal, a break may be made after the glottal. ij-no-poe-a'-a-po'i hua-e'-e-ri aro'-en 4. Do not hyphenate leaving a single letter at the beginning or end of a word. 6.4.2.2 Segment definition file. Table 4 shows the segment definition file needed for Amarakaeri. An underscored vowel is indicated by a closing brace (}) preceding the vowel. Amarakaeri segment definition file hab 15-May-85 CLASS C Consonants b c f h hu j k l m n p q r s t v w x z CLASS X Exception consonants d g y CLASS G Glottal ' CLASS V Vowels a e i o u }a }e }i }o }u }o}e }e}e }a}e ia }i}e io }i}o oe DOCUMENT PREPARATION AIDS 38 Table 4 - Amarakaeri segments 6.4.2.3 Overstrike unit file. Table 5 shows the overstrike unit file needed for Amarakaeri. An underscored vowel is indicated by a closing brace (}) preceding the vowel. Overstrike definition file for Amarakaeri 06-Jul-85 hab }a }e }i }o }u Table 5 - Amarakaeri overstrikes 6.4.2.4 Hyphenation change table. Table 6 shows the hyphenation change table needed for Amarakaeri. Amarakaeri Hyphenation Change Table hab 15-May-85 "VCV" > "V-CV" "VCGV" > "VC-GV" "VXGV" > "VX-GV" "CC" > "C-C" "XC" > "X-C" "CX" > "C-X" "XX" > "X-X" "CGC" > "CG-C" "CGX" > "CG-X" "XGC" > "XG-C" "XGX" > "XG-X" "VGV" > "Vg-V" "VG" > "Vg-" "VVV" > "Vv-V" "VVGV" > "Vvg-V" "VV" > "V-V" Table 6 - Amarakaeri hyphenation rules 6.4.3 Campa Pajonal Campa Pajonal is a Peruvian jungle language which belongs to the Arawakan language family. 6.4.3.1 Hyphenation rules. These rules were provided by Allene Heitzman. 1. The vowels are: a, e, i, o, and length, written as a geminate vowel, and the vowel clusters ae, and oe. 2. The consonants are: c, ch, g, j, jy, m, my, n, ñ, p, py, qu, qy, r, ry, s, sh, t, th, ts, ty, tz, v, vy, y. 3. The consonant clusters are (word medial only): mp, mpy, nc, nch, nqu, nqy, nt, nth, nts, nty, ntz. Hyphenation 39 4. Break after any vowel preceeding a consonant except before an m or n in a consonant cluster. 5. Do not break off less than four letters. 6.4.3.2 Segment definition file. Table 7 shows the segment definition file needed for Campa Pajonal. Campa Pajonal segment definition file hab 17-May-85 CLASS V Vowels a e i o aa ee ii oo ae oe /a /e /i /o CLASS C Consonants c ch g j jy m my n ~n p py qu qy r ry s sh t th ts ty tz v vy y CLASS N Word medial nasal consonant clusters mp nqu nth ntz mpy nqy nts nc nt nty nch Table 7 - Campa Pajonal segments 6.4.3.3 Overstrike unit file. Table 8 shows the overstrike unit file needed for Campa Pajonal. An accented vowel is preceded by a single slash (/), and an enyee is indicated by a tilde (~n). Overstrike definition file for Campa Pajonal 06-Jul-85 hab /a /e /i /o ~n Table 8 - Campa Pajonal overstrikes 6.4.3.4 Hyphenation change table. Table 9 shows the hyphenation change table needed for Campa Pajonal. DOCUMENT PREPARATION AIDS 40 Campa Pajonal hyphenation rules hab 17-May-85 "VC" > "V-C" c break after any vowel preceding a consonant c do not break if it is an m or n in a consonant cluster Table 9 - Campa Pajonal hyphenation rules 6.5 Miscellaneous 6.5.1 Program limitations While HYPHEN is quite general, it does have some limitations. 1. If a text has a mixture of vernacular and loan words, HYPHEN will try to hyphenate the loan words according to the rules of the vernacular. If the loan word contains some undefined sequence, then HYPHEN will ring the terminal bell and display an error message for the word and will not hyphenate it. (This is actually a fundamental problem of identifying loan words within a text). 2. In version 1.2, HYPHEN correctly handles a text containing Manuscripter bar commands (such as |b or |u). Earlier versions used to treat the b or u as a part of the word to be hyphenated and it would lose any capitalization of a word preceded by a bar command. 3. HYPHEN assumes that the orthography consists only of lowercase alphabetics. Thus it is not able to tell the difference between upper and lower case letters, even if, say capital letters were used to represent unvoiced vowels. Both will be treated as if they were lower case. In order for HYPHEN to correctly handle this situation, one will need to represent the unvoiced sound by some other unique sequence. 6.5.2 Testing method The following is a method one can use to test one's segment definition file and hyphenation change table. 1. Create a file that consists of the example words listed in the hyphenation rules. Put each word on a separate line. 2. Then make two copies of each word, each one on a separate line. 3. Place a backslash character in front of the first occurrence and insert hyphens where they should go. HYPHEN will then treat this as a standard format marker and not as a word. Hyphenation 41 4. Insert a space in front of the second word. 5. Run the file through the HYPHEN program and examine the results. If hyphenation has occurred correctly, the two occurences of the word will line up exactly. Here is an example of part of such a test file for Spanish. \o-lla olla \ca-be-za cabeza \re-pri-mir reprimir \co-pla copla \te-cla tecla \res-pi-ro respiro \obs-t'a-cu-lo obst'aculo The output would then look like this: \o-lla o-lla \ca-be-za ca-be-za \re-pri-mir re-pri-mir \co-pla co-pla \te-cla te-cla \res-pi-ro res-pi-ro \obs-t'a-cu-lo obs-t'a-cu-lo 6.5.3 Some change table techniques One can use the fact that the hyphenation rules are ordered to one's advantage. Consider an example from Ticuna, a Peruvian jungle language. The sequence arj needs to be hyphenated as -arj word finally and a-rj elsewhere (j is a vowel). The segment defintion file includes the following classes: TIPHYP.SEG Character classes for Ticuna (Peru) hyphenation CLASS V e i o u CLASS C b c ch d f g l m n ~n ng p q s t w y CLASS A a CLASS J j CLASS R DOCUMENT PREPARATION AIDS 42 r Notice that a, r, and j are in separate classes by themselves. The hyphenation rules include the following changes: TIPHYP.CHG changes for hyphenation of Ticuna (Peru) "ARJ#" > "-arj#" "A" > "V" "J" > "V" "R" > "C" "VCV" > "V-CV" Note here that the word final exception is treated first. If the sequence arj is not word final, then the second through fourth changes will convert the "ARJ" class sequence into a "VCV" class sequence. This allows the final change to make the correct hyphenation. 7. DELIMITER CHECKING AND NESTING CHECK (DELIM) 7.1 Introduction Delimiters are symbols used in pairs to enclose specific information. The most common delimiter pair is parentheses. others are square brackets or curly braces. DELIM tests whether delimiters are paired and properly nested. The user may specify the delimiters to be checked; for example, he may wish to check the following: ( ) { } " " ` ' [ ] < > DELIM reports the errors in such a way that they are easy to find. Multiple files may be checked. DELIM never changes the file that is being checked. DELIM is useful for the preparation of any text which makes use of delimiters. For example, many linguistic papers have frequent parentheses, phonetic and phonemic bracketing ([] and //), and glosses (`') all of which must be balanced and properly nested, for example, [atox] /atuq/ `fox'. Sometimes formatting programs (e.g., SCRIBE) and often programming languages (e.g., PTP, C) require heavy use of delimiters. (While errors in these can sometimes be discovered by running the program, it will generally be much quicker to discover the errors with DELIM and correct them before running the program.) 7.2 Running the program DELIM begins to run with the following message: DELIMITER PAIRING AND NESTING CHECK Version 2.1 (12-Dec-86) Press <RETURN> to use these delimiters: ({[" )}]" Delimiter Checking and Nesting Check 43 Otherwise type delimiter file name: If you are satisfied with this list of delimiters, simply type a carriage return. Otherwise specify the name of the delimiter file that includes the delimiters you want to check. The form of such a file is discussed in section 7.4. Next you will be asked for an output file: Output file: [con] If you simply type a carriage return, the output will be put to the terminal. If you wish to have the output printed directly (i.e., without first creating a file on some device), respond with prn (or however you refer to your printer). If you type a file name, the result will be written to that file. Next, by means of the prompt Input file: you are asked for the file to be checked. Respond with the appropriate file name. When DELIM finishes checking the first file, it asks for another file to be checked: Next input file: (<RETURN> if no more) When there are no more files to be checked, simply type a carriage return to return to the monitor. 7.3 The form of the output The output file will contain, for each file being checked, its name, the potential errors found in that file, and the number of potential errors found in that file. There are two sorts of errors. First, there might be a right delimiter for which there was no previous corresponding left delimiter. For example, if a file started with the line This is a file ] which has an error. the error would be reported as follows: unmatched right ] on line 1 This is a file ] which has an error. ^ If a 15 line file ended with This is a file { which has an error. the error would be reported as follows: unmatched left { on line 15 This is a file { ^ DOCUMENT PREPARATION AIDS 44 7.4 How to write a delimiter file To specify delimiters other than the defaults, it is necessary to create a delimiter file. This file contains two or three lines. The optional first line is reserved for comments such as "Delim file for XYZ." The second line should list (without intervening spaces, commas, etc.) all the left delimiters. The third line should list the corresponding right delimiters, with each right delimiter directly below the corresponding left delimiter. For example, The following is an acceptable delimiter file (where there is nothing on lines two and three other than the delimiters, and all lines end with a carriage return): This is a DELIM file for XYZ [{( ]}) Any character can be given as a delimiter, but note, a delimiter can only be a single character. If the last two lines of the delimiter file are not the same length, you will be informed with the message when the program runs: Delimiter lists are not the same length. It is possible (and sometimes desirable) to give the delimiters to be checked directly from the terminal. This can be done by giving the terminal device name in response to the prompt (tt: for RT-11, con for MS/DOS) for a delimiter file name, typing the two lines of left and right delimiters, and then closing the file with a ^Z (control Z). For example, if one wished to check only the delimiter pairs ( ) and [ ], he could respond to the prompt for a delimiter file with tt: (on RT-11 systems), then type the sequence: ( [ <RETURN> ) ] <RETURN> ^Z 7.5 Program limitations One actual error sometimes causes DELIM to report many errors (i.e., errors are said to "cascade"). Thus, sometimes error messages subsequent to a real error should simply be disregarded. If the real error is fixed, the subsequent (erroneous) error messages go away. Too many unmatched left delimiters (more than approximately 15) will cause DELIM to terminate with a message beginning "Stack overflow..." If this happens, control is returned to the monitor. Try checking the file with fewer delimiter pairs or correct what errors you can and rerun the program. Delimiters cannot span files, that is, corresponding delimiters must be in the same file. DELIM does not ignore delimiters in comments or in quoted strings. DELIM can check only 99 pairs of delimiters.